Red Wine Quality Exploration by Pengwei

I used to buy red wine from the liquid store a few years ago and enjoy the red wine when I am alone. I used to buy red wine made in France since I perfer the taste of red wine made here. But I don’t know any features that determine the quality of red wine. By using the data science technology we could analysis which features may lead to the best quality of red wine.

This report is about the red wine quality (https://s3.amazonaws.com/udacity-hosted-downloads/ud651/wineQualityInfo.txt). We explore a dataset of red wine containing 1599 red wines and 12 attributes on the chemical variables of the wine. At least 3 wine experts rated the quality of each wine, providing a rating between 0 (very bad) and 10 (very excellent). The attributes of red wine are as follow:

1 - fixed acidity: most acids involved with wine or fixed or nonvolatile (do not evaporate readily)

2 - volatile acidity: the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste

3 - citric acid: found in small quantities, citric acid can add ‘freshness’ and flavor to wines

4 - residual sugar: the amount of sugar remaining after fermentation stops, it’s rare to find wines with less than 1 gram/liter and wines with greater than 45 grams/liter are considered sweet

5 - chlorides: the amount of salt in the wine

6 - free sulfur dioxide: the free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion; it prevents microbial growth and the oxidation of wine

7 - total sulfur dioxide: amount of free and bound forms of S02; in low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine

8 - density: the density of water is close to that of water depending on the percent alcohol and sugar content

9 - pH: describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4 on the pH scale

10 - sulphates: a wine additive which can contribute to sulfur dioxide gas (S02) levels, wich acts as an antimicrobial and antioxidant

11 - alcohol: the percent alcohol content of the wine

Output attribute (based on sensory data): 12 - quality (score between 0 and 10)

Univariate Plots Section

## [1] 1599
## [1] 13
## 'data.frame':    1599 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...
##   X fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1 1           7.4             0.70        0.00            1.9     0.076
## 2 2           7.8             0.88        0.00            2.6     0.098
## 3 3           7.8             0.76        0.04            2.3     0.092
## 4 4          11.2             0.28        0.56            1.9     0.075
## 5 5           7.4             0.70        0.00            1.9     0.076
## 6 6           7.4             0.66        0.00            1.8     0.075
##   free.sulfur.dioxide total.sulfur.dioxide density   pH sulphates alcohol
## 1                  11                   34  0.9978 3.51      0.56     9.4
## 2                  25                   67  0.9968 3.20      0.68     9.8
## 3                  15                   54  0.9970 3.26      0.65     9.8
## 4                  17                   60  0.9980 3.16      0.58     9.8
## 5                  11                   34  0.9978 3.51      0.56     9.4
## 6                  13                   40  0.9978 3.51      0.56     9.4
##   quality
## 1       5
## 2       5
## 3       5
## 4       6
## 5       5
## 6       5
##  [1] "X"                    "fixed.acidity"        "volatile.acidity"    
##  [4] "citric.acid"          "residual.sugar"       "chlorides"           
##  [7] "free.sulfur.dioxide"  "total.sulfur.dioxide" "density"             
## [10] "pH"                   "sulphates"            "alcohol"             
## [13] "quality"

Our dataset consists of 13 attributes, with 1599 observations.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.636   6.000   8.000
## 
##   3   4   5   6   7   8 
##  10  53 681 638 199  18

All the scores of wine quality are between 3 to 8. There is no wine of score 1, 2, 9 and 10. The score of wine quality seems distributed on a small range scope.

We are wondering what the plot looks like across the categorical attributes such as fixed.acidity, volatile.acidity and so on.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    4.60    7.10    7.90    8.32    9.20   15.90

The lowest fixed acidity is 4.6 and highest is 15.9. Here I plot the main body of the fixed acidity. The distribution of fixed acidity seems to be skewed to positive skew.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1200  0.3900  0.5200  0.5278  0.6400  1.5800

The mean of volatile acidity is 0.5278 and median is 0.52. The shape of valatile acidity seems to be a positive skewed distribution.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.090   0.260   0.271   0.420   1.000
## 
##    0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09  0.1 0.11 0.12 0.13 0.14 
##  132   33   50   30   29   20   24   22   33   30   35   15   27   18   21 
## 0.15 0.16 0.17 0.18 0.19  0.2 0.21 0.22 0.23 0.24 0.25 0.26 0.27 0.28 0.29 
##   19    9   16   22   21   25   33   27   25   51   27   38   20   19   21 
##  0.3 0.31 0.32 0.33 0.34 0.35 0.36 0.37 0.38 0.39  0.4 0.41 0.42 0.43 0.44 
##   30   30   32   25   24   13   20   19   14   28   29   16   29   15   23 
## 0.45 0.46 0.47 0.48 0.49  0.5 0.51 0.52 0.53 0.54 0.55 0.56 0.57 0.58 0.59 
##   22   19   18   23   68   20   13   17   14   13   12    8    9    9    8 
##  0.6 0.61 0.62 0.63 0.64 0.65 0.66 0.67 0.68 0.69  0.7 0.71 0.72 0.73 0.74 
##    9    2    1   10    9    7   14    2   11    4    2    1    1    3    4 
## 0.75 0.76 0.78 0.79    1 
##    1    3    1    1    1

The shape of citric acid is not clear. So I use the log scale in x coordinate to scale the distribution. There are large mount of wines with 0 citric acid and with 0.49 critic acid.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.900   1.900   2.200   2.539   2.600  15.500

The mean of residual sugar is 2.539 and median is 2.2. The distribution seems to be positive skewed distribution.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.01200 0.07000 0.07900 0.08747 0.09000 0.61100

The mean of residual sugar is 0.08747 and median is 0.079. The distribution seems to be positive skewed distribution.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    7.00   14.00   15.87   21.00   72.00

The mean of free sulfur dioxide is 15.87 and median is 14. The min of free sulfur dioxide is 1 and max is 72. The shape of distribution seems like a positive skewed distribution.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    6.00   22.00   38.00   46.47   62.00  289.00

The mean of free sulfur dioxide is 46.47 and median is 38. The min of free sulfur dioxide is 6 and max is 289. The shape of distribution seems like a positive skewed distribution.

The shape of free sulfur dioxide and total sulfur dioxide distributions seems corrlated. We are not sure if the weight of free sulfur dioxide is a propotion of the weight of total sulfur dioxide. So we plot the rate of free.sul.dioxide/total.sulfur.dioxide.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.02273 0.25926 0.37500 0.38231 0.48485 0.85714

The distribution of sulfur dioxide rate, which is the division of free sulfur dioxide and total sulfur dioxide, is not a constant. The min of rate is 0.227 and max is 0.857. The shape of rate distribution seems like a normal distribution. So we are not sure the relation between the mount of free sulfur dioxide and the mount of total sulfur dioxide.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9901  0.9956  0.9968  0.9967  0.9978  1.0037

The mean of density is 0.9967 and median is 0.9968. The min of density is 0.9901 and max is 1.0037. The shape of distribution seems like a normal distribution.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.740   3.210   3.310   3.311   3.400   4.010

The mean of pH is 3.311 and median is 3.310. The min of pH is 2.74 and max is 4.01. The shape of distribution seems like a normal distribution.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.3300  0.5500  0.6200  0.6581  0.7300  2.0000

The shape of sulphates distribution seems like a positive skewed distribution.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.50   10.20   10.42   11.10   14.90

The mean of free sulfur dioxide is 10.42 and median is 10.20. The min of free sulfur dioxide is 8.4 and max is 14.9. The shape of distribution and log scaled distribution are not clear and not correlate to the quality of red wine. In reality the alcohol in red wine seems not relate to the quality of red wine. This two diagrams may be implied by the reality situations.

Univariate Analysis

What is the structure of your dataset?

There are 1599 red winds in the dataset with 11 features(fixed.acidity, volatile.acidity, citric.acid, residual.sugar, chlorides, free.sulfur.dioxide, total.sulfur.dioxide, density, pH, sulphates, alcohol). All the features types are num except the type of quality is int. We also have the following observations:

  • Most quality score are in 5, 6 and 7.
  • The median fixed acidity is 7.9
  • Most wine has volatile acidity in range 0.12 to 0.9
  • About 75% of residual sugar have weight less than 2.6
  • The median score of quality is 6. There is no quality less than 3 or more than 8.

What is/are the main feature(s) of interest in your dataset?

The main features of the red wine dataset are the quality. I would like to find the other features such as fixed acidity, alcohol and so on to find the correlation relation between these attributes and quality. We’d like to find which features are best for predicting the quality of a red wine.

What other features in the dataset do you think will help support your
investigation into your feature(s) of interest?

There are many features may contribute to the quality of red wines: fixed.acidity, volatile.acidity, citric.acid, residual.sugar, chlorides, free.sulfur.dioxide, total.sulfur.dioxide, density, pH, sulphates. We are not sure in current stage which feature may contribute the quality most. But the feature such as alcohol and citric acid may not contribute to the quality of red wine.

Did you create any new variables from existing variables in the dataset?

We create the sulfur dioxide rate, which is the rate of free sulfur dioxide and total sulfur dioxide. The shape of two distributions seems similar so we are interested to find if there is any closed relation between two attributes. But the distribution of sulfur dioxide rate seem like a normal distribution with mean 0.382. It is hard to get any conclusion on relation between free sulfur dioxide and total sulfur dioxide.

Of the features you investigated, were there any unusual distributions?
Did you perform any operations on the data to tidy, adjust, or change the form
of the data? If so, why did you do this?

I log-transformed the citric.acid and alcohol since the distributions of these two diagrams seems not clear. But even I applied the log-transform, I still can not get a clear distribution form on citric.acid and alcohol. The features such as citric.acid and alcohol may not have strong correlate to the quality of red wines.

Bivariate Plots Section

##                      fixed.acidity volatile.acidity citric.acid
## fixed.acidity                 1.00            -0.26        0.67
## volatile.acidity             -0.26             1.00       -0.55
## citric.acid                   0.67            -0.55        1.00
## residual.sugar                0.11             0.00        0.14
## chlorides                     0.09             0.06        0.20
## free.sulfur.dioxide          -0.15            -0.01       -0.06
## total.sulfur.dioxide         -0.11             0.08        0.04
## density                       0.67             0.02        0.36
## pH                           -0.68             0.23       -0.54
## sulphates                     0.18            -0.26        0.31
## alcohol                      -0.06            -0.20        0.11
## quality                       0.12            -0.39        0.23
##                      residual.sugar chlorides free.sulfur.dioxide
## fixed.acidity                  0.11      0.09               -0.15
## volatile.acidity               0.00      0.06               -0.01
## citric.acid                    0.14      0.20               -0.06
## residual.sugar                 1.00      0.06                0.19
## chlorides                      0.06      1.00                0.01
## free.sulfur.dioxide            0.19      0.01                1.00
## total.sulfur.dioxide           0.20      0.05                0.67
## density                        0.36      0.20               -0.02
## pH                            -0.09     -0.27                0.07
## sulphates                      0.01      0.37                0.05
## alcohol                        0.04     -0.22               -0.07
## quality                        0.01     -0.13               -0.05
##                      total.sulfur.dioxide density    pH sulphates alcohol
## fixed.acidity                       -0.11    0.67 -0.68      0.18   -0.06
## volatile.acidity                     0.08    0.02  0.23     -0.26   -0.20
## citric.acid                          0.04    0.36 -0.54      0.31    0.11
## residual.sugar                       0.20    0.36 -0.09      0.01    0.04
## chlorides                            0.05    0.20 -0.27      0.37   -0.22
## free.sulfur.dioxide                  0.67   -0.02  0.07      0.05   -0.07
## total.sulfur.dioxide                 1.00    0.07 -0.07      0.04   -0.21
## density                              0.07    1.00 -0.34      0.15   -0.50
## pH                                  -0.07   -0.34  1.00     -0.20    0.21
## sulphates                            0.04    0.15 -0.20      1.00    0.09
## alcohol                             -0.21   -0.50  0.21      0.09    1.00
## quality                             -0.19   -0.17 -0.06      0.25    0.48
##                      quality
## fixed.acidity           0.12
## volatile.acidity       -0.39
## citric.acid             0.23
## residual.sugar          0.01
## chlorides              -0.13
## free.sulfur.dioxide    -0.05
## total.sulfur.dioxide   -0.19
## density                -0.17
## pH                     -0.06
## sulphates               0.25
## alcohol                 0.48
## quality                 1.00

It seems most of attributes are not highly correlate with each other, especially the quality attribute and other attributes. But there are still some attributes that are correlate with each other, for instance, volatile.acidity and citric.acid, fixed.acidity and pH and so on. We analyze the correlated relation between quality attributes and other attributes. We first use the two most correlated varaibles: alcohol and volatile.acidity to evaluate the relation.

## 
##  Pearson's product-moment correlation
## 
## data:  redwine$alcohol and redwine$quality
## t = 21.639, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.4373540 0.5132081
## sample estimates:
##       cor 
## 0.4761663

## 
##  Pearson's product-moment correlation
## 
## data:  redwine$volatile.acidity and redwine$quality
## t = -16.954, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.4313210 -0.3482032
## sample estimates:
##        cor 
## -0.3905578

It seems the quality increase if the alcohol degree increase and volatile.acidity decrease. But the correlation relation between quality and either of the alcohol and volatile.acidity seems quite weak, that is 0.4761663 between alcohol and quality, and -0.3905578 between volatil acidity and quality.

So we try to explore the other highest six correlated attributes: sulphates, citric.acid and so on, to check if there exists any strong correlation relationship between any of these attributes and quality.

## 
##  Pearson's product-moment correlation
## 
## data:  redwine$sulphates and redwine$quality
## t = 10.38, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.2049011 0.2967610
## sample estimates:
##       cor 
## 0.2513971

## 
##  Pearson's product-moment correlation
## 
## data:  redwine$citric.acid and redwine$quality
## t = 9.2875, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.1793415 0.2723711
## sample estimates:
##       cor 
## 0.2263725

## 
##  Pearson's product-moment correlation
## 
## data:  redwine$fixed.acidity and redwine$quality
## t = 4.996, df = 1597, p-value = 6.496e-07
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.07548957 0.17202667
## sample estimates:
##       cor 
## 0.1240516

## 
##  Pearson's product-moment correlation
## 
## data:  redwine$chlorides and redwine$quality
## t = -5.1948, df = 1597, p-value = 2.313e-07
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.17681041 -0.08039344
## sample estimates:
##        cor 
## -0.1289066

## 
##  Pearson's product-moment correlation
## 
## data:  redwine$total.sulfur.dioxide and redwine$quality
## t = -7.5271, df = 1597, p-value = 8.622e-14
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.2320162 -0.1373252
## sample estimates:
##        cor 
## -0.1851003

## 
##  Pearson's product-moment correlation
## 
## data:  redwine$density and redwine$quality
## t = -7.0997, df = 1597, p-value = 1.875e-12
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.2220365 -0.1269870
## sample estimates:
##        cor 
## -0.1749192

It seems all these attributes are not strongly correlated to quality as alcohol and volatile.acidity, since the highest correlation value is 0.2513971, between sulphates and quality.

Then we try to find the linear relation between quality and any of the attributes such as alcohol, volatile.acidity and so on, using linear regression.

## 
## Calls:
## m1: lm(formula = I(quality) ~ I(alcohol), data = redwine)
## m2: lm(formula = I(quality) ~ I(volatile.acidity), data = redwine)
## m3: lm(formula = I(quality) ~ I(sulphates), data = redwine)
## m4: lm(formula = I(quality) ~ I(citric.acid), data = redwine)
## m5: lm(formula = I(quality) ~ I(fixed.acidity), data = redwine)
## m6: lm(formula = I(quality) ~ I(chlorides), data = redwine)
## m7: lm(formula = I(quality) ~ I(total.sulfur.dioxide), data = redwine)
## m8: lm(formula = I(quality) ~ I(density), data = redwine)
## 
## ====================================================================================================================
##                               m1         m2         m3         m4         m5         m6         m7         m8       
## --------------------------------------------------------------------------------------------------------------------
##   (Intercept)              1.875***   6.566***   4.848***   5.382***   5.157***   5.829***   5.847***   80.239***   
##                           (0.175)    (0.058)    (0.078)    (0.034)    (0.098)    (0.042)    (0.034)    (10.508)     
##   I(alcohol)               0.361***                                                                                 
##                           (0.017)                                                                                   
##   I(volatile.acidity)                -1.761***                                                                      
##                                      (0.104)                                                                        
##   I(sulphates)                                   1.198***                                                           
##                                                 (0.115)                                                             
##   I(citric.acid)                                            0.938***                                                
##                                                            (0.101)                                                  
##   I(fixed.acidity)                                                     0.058***                                     
##                                                                       (0.012)                                       
##   I(chlorides)                                                                   -2.212***                          
##                                                                                  (0.426)                            
##   I(total.sulfur.dioxide)                                                                   -0.005***               
##                                                                                             (0.001)                 
##   I(density)                                                                                           -74.846***   
##                                                                                                        (10.542)     
## --------------------------------------------------------------------------------------------------------------------
##   R-squared                    0.2        0.2        0.1        0.1        0.0        0.0        0.0         0.0    
##   adj. R-squared               0.2        0.2        0.1        0.1        0.0        0.0        0.0         0.0    
##   sigma                        0.7        0.7        0.8        0.8        0.8        0.8        0.8         0.8    
##   F                          468.3      287.4      107.7       86.3       25.0       27.0       56.7        50.4    
##   p                            0.0        0.0        0.0        0.0        0.0        0.0        0.0         0.0    
##   Log-likelihood           -1721.1    -1794.3    -1874.4    -1884.6    -1914.2    -1913.2    -1898.8     -1901.8    
##   Deviance                   805.9      883.2      976.3      988.8     1026.1     1024.8     1006.5      1010.3    
##   AIC                       3448.1     3594.6     3754.9     3775.2     3834.5     3832.5     3803.5      3809.6    
##   BIC                       3464.2     3610.8     3771.0     3791.3     3850.6     3848.6     3819.7      3825.7    
##   N                         1599       1599       1599       1599       1599       1599       1599        1599      
## ====================================================================================================================

It seems that any of the Attributes are weakly correlated to the quality. The relation between any of these attributes and the quality seems to be non-linear. Based on the R^2 value, alcohol or volatile.acidity has the highest linear contribution to the quality score, but only explain around at most 20 percent of the variance in quality.

Next we check these eight attributes and see their variation with quality.

For the first four attributes: alcohol, volatile.acidity, sulphates and citric.acid, we see that the quality increase or decrease as these attributes increase. For volatile.acidity, the variation seems decrease as teh quality increase. But for other Attributes, the change of variations are not obvious.

We explore other highly correlated attributes: First we analyze pH and fixed.acidity.

The fixed.acidity and pH are strongly correlated. The value of fixed.acidity decreases as the pH value increase. It explains that if there are more acidity in the wine, the pH value will decrease, vice vase.

Next, we explore the other highest correlated relations such as density and fixed.acidity and so on.

These diagrams are the other four closely correlated relationship between Attributes in red wine. From these picture we found that:

Bivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. How did the feature(s) of interest vary with other features in
the dataset?

The quality of red wine are not strongly correlated to any of single variables. We first test the highest two correlated attributes with quality, that is alcohol and volatile acidity. We find that these two attributes are not high correlated with the quality. The relationship between quality and alcohol or volatile.acidity seems non-linear. Based on R^2 value, the alcohol or volatile.acidity explains about only at most 20 percent of the variance in quality.

Then we test the other six features that seems still correlate with quality but none of them has strong correlation relation with the quality attribute. Based on R^2 value, the sulphates contributes at most 10 percent of the variance in quality among all the other features. It seems any of the red wine attributes are not linear correlated with quality. We could not predict the quality of red wind by using any of the features. So we would explore the combination of these attributes to find the linear relation with quality in the next section.

The variation of these attributes with quality is not strong as well. But we could find that the variation of volatile acidity will decrease as the quality value increase.

Did you observe any interesting relationships between the other features
(not the main feature(s) of interest)?

The fixed.acidity and pH seems closely correlated with each other. The fixed.acidity decreases as the pH increase. It explains that if there are more acidity in the wine, the pH value will decrease, vice vase. There are also the other four closely correlated relationship between Attributes in red wine. From these picture we found that:

  • The increase of density of red wine will increase the fixed acidity. It explains that the fixed acidity is a fixed ingredient in the red wine. The increase other ingredient will increase fixed acidity as well.

  • The increase of citric acid will increase fixed acidity. May be these two ingredient in the red wine are added together.

  • The increase of total sulfur dioxide will increase the free sulfur dioxide. It seems the free sulfur dioxide is a component of total sulfur dioxide. So the increase of free sulfur dioxide will lead to the density increase of the sulfur dioxide.

  • The volatile acidity will make the citric acid decrease. It seems these two attributes could not co-exist in the red wine. Each component increasing will lead to the other fail down.

What was the strongest relationship you found?

The strongest relation is fixed.acidity and pH. The correlation value betwen pH and fixed acidity is around -0.68. The fixed.acidity is also closely correlated with density but not as strong as pH.

Multivariate Plots Section

We first explore the density plots fo the most correlated two attributes: alcohol and volatile acidity, for different quality values.

From the density plots, we see the better quality of red wine tend to occur more often at high alcohol density. The worse quality of red wine tend to occur more often at high volatile acidity density.

In last section, we found that alcohol has the most correlation relationship with the quality of wine. So we are interested in exploring here which attributes and alcohol together would make a better quality of red wine. We select three attributes: volatile.acid, pH, and free.sulfur.dioxide, to check if any combination of two attributes would get better quality of red wine.

The first diagram indicates the better quality of red wine should have low pH and high degree of alcohol.

The second diagram indicates better quality of red wine should have low volitile acidity and high alcohol.

The last indicates better red wind should have low free sulfur dioxide and high alcohol. All these three diagram show the combination of two attribution, with one attributes as alcohol, would generate a better taste of red wines.

We also explore other correlatd attributes and their contributions on quality of red wine.

From these diagrams, we found that the quality of red wine are closely related to a few group of attributes.

Finally, we explore the linear relation between quality and highest 8 correlatd attributes

## 
## Calls:
## m1: lm(formula = I(quality) ~ I(alcohol), data = redwine)
## m2: lm(formula = I(quality) ~ I(alcohol) + volatile.acidity, data = redwine)
## m3: lm(formula = I(quality) ~ I(alcohol) + volatile.acidity + sulphates, 
##     data = redwine)
## m4: lm(formula = I(quality) ~ I(alcohol) + volatile.acidity + sulphates + 
##     citric.acid, data = redwine)
## m5: lm(formula = I(quality) ~ I(alcohol) + volatile.acidity + sulphates + 
##     citric.acid + fixed.acidity, data = redwine)
## m6: lm(formula = I(quality) ~ I(alcohol) + volatile.acidity + sulphates + 
##     citric.acid + fixed.acidity + chlorides, data = redwine)
## m7: lm(formula = I(quality) ~ I(alcohol) + volatile.acidity + sulphates + 
##     citric.acid + fixed.acidity + chlorides + total.sulfur.dioxide, 
##     data = redwine)
## m8: lm(formula = I(quality) ~ I(alcohol) + volatile.acidity + sulphates + 
##     citric.acid + fixed.acidity + chlorides + total.sulfur.dioxide + 
##     density, data = redwine)
## 
## =================================================================================================================
##                            m1         m2         m3         m4         m5         m6         m7         m8       
## -----------------------------------------------------------------------------------------------------------------
##   (Intercept)           1.875***   3.095***   2.611***   2.646***   2.202***   2.363***   2.652***   28.165      
##                        (0.175)    (0.184)    (0.196)    (0.201)    (0.224)    (0.228)    (0.240)    (15.083)     
##   I(alcohol)            0.361***   0.314***   0.309***   0.309***   0.320***   0.304***   0.288***    0.268***   
##                        (0.017)    (0.016)    (0.016)    (0.016)    (0.016)    (0.017)    (0.017)     (0.021)     
##   volatile.acidity                -1.384***  -1.221***  -1.265***  -1.343***  -1.239***  -1.173***   -1.137***   
##                                   (0.095)    (0.097)    (0.113)    (0.113)    (0.117)    (0.118)     (0.120)     
##   sulphates                                   0.679***   0.696***   0.701***   0.851***   0.888***    0.916***   
##                                              (0.101)    (0.103)    (0.103)    (0.111)    (0.111)     (0.112)     
##   citric.acid                                           -0.079     -0.469***  -0.335*    -0.203      -0.198      
##                                                         (0.104)    (0.137)    (0.141)    (0.145)     (0.145)     
##   fixed.acidity                                                     0.057***   0.050***   0.037**     0.055**    
##                                                                    (0.013)    (0.013)    (0.014)     (0.017)     
##   chlorides                                                                   -1.430***  -1.576***   -1.584***   
##                                                                               (0.408)    (0.408)     (0.408)     
##   total.sulfur.dioxide                                                                   -0.002***   -0.002***   
##                                                                                          (0.001)     (0.001)     
##   density                                                                                           -25.583      
##                                                                                                     (15.122)     
## -----------------------------------------------------------------------------------------------------------------
##   R-squared                 0.2        0.3        0.3        0.3        0.3        0.3        0.4         0.4    
##   adj. R-squared            0.2        0.3        0.3        0.3        0.3        0.3        0.4         0.4    
##   sigma                     0.7        0.7        0.7        0.7        0.7        0.7        0.7         0.6    
##   F                       468.3      370.4      268.9      201.8      167.0      142.2      124.9       109.8    
##   p                         0.0        0.0        0.0        0.0        0.0        0.0        0.0         0.0    
##   Log-likelihood        -1721.1    -1621.8    -1599.4    -1599.1    -1589.6    -1583.5    -1576.5     -1575.1    
##   Deviance                805.9      711.8      692.1      691.9      683.7      678.5      672.6       671.4    
##   AIC                    3448.1     3251.6     3208.8     3210.2     3193.3     3183.0     3171.1      3170.2    
##   BIC                    3464.2     3273.1     3235.7     3242.4     3230.9     3226.0     3219.5      3224.0    
##   N                      1599       1599       1599       1599       1599       1599       1599        1599      
## =================================================================================================================

Despite we include more correlatd varaibles (total 8), based on the R^2 value, the quality of red wine still be explained 40 percent by all the 8 variables. So the linear relationship between the combination of attributes and the quality of red wine is still weak, even though we combine more attributions to explore the linear relation with quality.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. Were there features that strengthened each other in terms of
looking at your feature(s) of interest?

We find there are a few correlated features that could strenthen each other.

  • If the value of density fixed, the increase of fixed acidity would also increase. The increasing of two attributes make a better quality of red wine.

  • If the value of fixed acidity, the citric acid increase. The increasing of both attributes lead to the quality of red wine increase.

  • If the free sulfur dioxide, the total sulfur dioxide increase. The increasing of both attributes lead to the quality of red wine increase. Since free sulfur dioxide and total sulfur dioxide attributes are closey correlated, any one of the two attributes change would contribute to the quality of red wine change.

  • The decrease of volatile acidity will increase citirc acid. And low volatile acidity and high ritric acid would lead to a better quality of red wine.

Were there any interesting or surprising interactions between features?

We select the two most correlated attributes with the quality score: alcohol and volatile acidity. We found the two attribute could lead to 30 percent contribution of variation of quality value. The other attributes does not contribute to the quality very much. We also find that if we want to make a better red wine, it is reasonable to choose high alcohol and volatile acidity. This new discovered rule may help the wine producter to produce good quality of red wine. It may also help me to pick good quality of red wine when I buy red wine on liquid store, by checking the two index.


Final Plots and Summary

Tip: You’ve done a lot of exploration and have built up an understanding of the structure of and relationships between the variables in your dataset. Here, you will select three plots from all of your previous exploration to present here as a summary of some of your most interesting findings. Make sure that you have refined your selected plots for good titling, axis labels (with units), and good aesthetic choices (e.g. color, transparency). After each plot, make sure you justify why you chose each plot by describing what it shows.

Plot One

Description One

We are interested in finding the correlated relationship between quality and volatile acidity, since these two variables has strong correlation value. From the boxplot, and also the assistant density plot, it shows volatile acidity highly effects on the quality of red wine. The increasing of the volatile acidity, the quality score value will decrease. It implies the lower level of volatile acidity may lead to a better quality of red wines.

Plot Two

Description Two

The alcohol and quality score has the highest correlation value. So we use the scatter point to demonstrate the relation between quality and alcohol density. We found that the density of alcohol highly effects on the quality of red wine. The increasing of the alcohol will have a higher ranking of quality score. It implies that the high alcohol red wine would make a better quality of red wines.

Plot Three

Description Three

We have found that the quality of red wine are closely related to alcohol and volatile acidity, individually. Based on the correlation test and linear regression and the previous two plots on alchcol vs quality and volatile acidity vs quality, we found these two attributes alchcol and volatile acidity are highly correlated with quality of red wine. But we are interested in finding if the combination of these two highly correlated variable would generate a better quality of red wine. Here we plot the two ingredient together to see how the red wine quality are related to the combination of two variables. We found that the increasing of alcohol and decreasing volatile acidity would relate to a better quality of red wine. It implies that if we would like to pick a good quality of red wine, we should pick the red wine with low volatile acidity and high alcohol density.


Reflection

In this report, we explore the quality of red wine, and its relationship with other features that made red wine. Our analysis are explored by three sections.

We find a few interesting result on red wine quality and ingredient attributes.

  1. The alcohol and volatile acidity together are closely correlated to the red wine quality. Based on R^2 value, the combination of alcohol and volatile acidity explains 30 percent of the variance of quality. The increasing of alcohol and decrease of volatile acidity will produce a better quality of red wine.

  2. The other features of red wine seems hard to correlate with the quality score. The combination of top 8 attributes, including the alcohol and volatile acidity, explains only 40 percent of the variance of quality.

  3. Some attributes are correlated. Free sulfur dioxide and total sulfur dioxide are positively correlated. The increasing of free sulfur dioxide increase the total sulfur dioxide. The pH and fixed acidity are negtively correlated. The decreasing of fixed acidity increase the pH value.

  4. Most quality scores are ranked 5 and 6. There is no high score such as 9 and 10 or low score like 0, 1, 2.

Success

We found the two most correlated features for the red wine quality: alcohol and volatile acidity. These correlation relation give us some hints on how to pick the good quality red wine in market.

Difficulites

One of the difficulites that we have when we analyze the data is the unfamilarity of the measurement of red wines. There are a few chemical terms that is unfamilar to non chemistriy specialists. Though we could get some correlated relationship between a few attributes, we can not explain the real reason for the correction of two attributes.

The correlated relation between our interested attributes: qualtiy score and other attributres such as alcohol are not very strongly correlated. We have to find a few attributes and combine them together to find the possible relation between red wine quality and these attributes.

Future Work

There are a few suggestions to improve the data analysis of red wine quality in future. First, the data set contains a few records of red wine quality test, that is 1599. We could include more red wine test to gain a better analysis. Second, the range of score of quality is very limited. Most score are 5, 6, or 7, and there are no score at 0, 1, 2, 9, 10. It makes hard to distinguish the quality of red wine with such few score ranks marked. We could set the score in range of 0 to 100, or allow the float types of quality score. With more data records in data set and more quality score range, we could get a better analysis result on the quality of red wine.